-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add performance regression testing workflow #332
Add performance regression testing workflow #332
Conversation
Would you mind adding the sample run of this workflow to the description? |
on: | ||
pull_request: | ||
branches: [ master ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to trigger the workflow only when the source code of ion-c
has been changed.
on:
pull_reuqest:
paths:
- 'ionc/*'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, adding.
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/realWorldDataSchema01.isl testData/realWorldDataSchema01.10n | ||
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/realWorldDataSchema02.isl testData/realWorldDataSchema02.10n | ||
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/realWorldDataSchema03.isl testData/realWorldDataSchema03.10n | ||
|
||
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/nestedList.isl testData/nestedList.10n | ||
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/nestedStruct.isl testData/nestedStruct.10n | ||
java -jar $jar_file generate -S 200000 --input-ion-schema $schema_dir/sexp.isl testData/sexp.10n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: We might prefer to use a for loop to avoid repeated logic.
for test in nestedStruct nestedList sexp realWorldDataSchema01 realWorldDataSchema02 realWorldDataSchema03
do
java -jar $jar_file generate -S ${{env.data_size}} --input-ion-schema $schema_dir/${test}.isl testData/${test}.10n
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea.. I had originally broken it out because what I had taken from either ion-python's, or ion-java's
workflow had -S 100
which resulted in some wildly different dataset sizes. I think the sexp.10n was 195bytes, which definitely wasn't large enough to overcome the normal runtime noise (5% diff ended up being ~195ns on my system). Once I saw what -S
does, and realized what was going on, I didn't re-loop it. I'll tidy that up.
[[ -e /usr/bin/cmake ]] || ln -s `which cmake3` /usr/bin/cmake | ||
|
||
- name: Get Data Generator | ||
uses: actions/checkout@v2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we pin to a specific commit instead of commit hash tag? Tags can be updated and re-pointed, so the code referenced by a tag can change. However the commit SHA represents the exact snapshot of the code at a point in time.
The latest version of actions/checkout is recommended. actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
Same suggestion to other actions/checkout
call.
env: | ||
compare: candidate/tools/ion-bench/deps/google-benchmark/tools/compare.py | ||
cli_path: candidate/build/profiling/tools/ion-bench/src/IonCBench | ||
alpha: 0.03 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to know if alpha value can be adjusted based on the accuracy of regression-detection workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, alpha can be adjusted in order to fine-tune the null-hypothesis test. That is why it is exposed, so that we can adjust as needed. However, right now the results of the null-hypothesis test aren't being used because the p-value that is being calculated between revisions almost always determines the benchmarks to have statistically-significant differences.
I plan to spend a little more time looking into the test to see if it is appropriate for these benchmarks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation.
@linlin-s, I'm not sure what you mean. The workflow ran as part of the checks, so all of the output is available there. The full output is a bit much for the PR description, do you mean just the output of the |
Apologize for the confusion. What I mean by sample run is already included in the checks. We can track the current workflow run in the checks since it is triggered by |
threshold_perc: 5 | ||
run: | | ||
echo "Printing Results" | ||
RESULTS=$(cat results.json | jq '.[] | select(.run_type == "aggregate" and (.name | endswith("_mean"))) | {name:.name,cpu_time_perc_diff:(.measurements[0].cpu*100)}|select(.cpu_time_perc_diff > '"${threshold_perc}"')') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approve this PR with one question here: When the cpu_time_perc_diff
> 0, it represents the candidates CPU run time is bigger than the baseline CPU run time right? If the value is bigger than the threshold value we will fail the workflow with regression detected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly. The way the compare.py
generates its output, we end up with data like this:
{
"name": "realWorldDataSchema03.10n_mean",
"label": "",
"measurements": [
{
"real_time": 12452149.371052643,
"cpu_time": 12451057.894736838,
"real_time_other": 12473477.750893278,
"cpu_time_other": 12470339.46428573,
"time": 0.0017128271758622316,
"cpu": 0.0015485888598303227
}
],
"time_unit": "ns",
"run_type": "aggregate",
"aggregate_name": "mean",
"utest": {}
},
This represents the baseline (cpu_time) and candidate's (cpu_time_other) mean (of 20 runs) wallclock, and cpu time for the realWorldDataSetSchema03 test. The cpu
field represents the relative difference ((candidate - baseline)/(baseline)
) between the candidate and the baseline. Likewise, the time
field represents the relative wallclock difference. Since the relative difference is a percentage of the baseline, I multiply it by 100 it so we can configure the threshold with whole values. Any positive value is a results of the candidate's time being larger than the baseline's.
The check step, passes the JSON output through that jq query. The JSON output from the compare.py, contains both the raw results from the candidate run, but also the comparisons to the baseline like I mentioned above. The jq query looks for any aggregate comparisons, for the mean, and gathers only results that exceed the threshold. If there are no values, the RESULTS
variable will be an empty string, otherwise it will contain JSON that includes the name and CPU %diff for each failing run. The step then just checks to see if RESULTS
is empty or not, when deciding if the step should fail or not. If it's not, it will also use jq to print out the name and %diff for each of the tests that exceed the threshold.
Hah.. typing this up made me realize I missed the aggregate_name
field. I could make the jq query a little friendlier. I'll follow up with that in a later PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation. Sounds good to me.
Issue #, if available: n/a
Description of changes:
This PR adds a GHA workflow that builds both the HEAD of the target branch, as well as the PR's revision, and runs the ion-bench tool to generate benchmark results for a set of 200KiB binary ion datasets.
This PR adds only the read (deserialize_all) benchmark to the workflow. An update to add the write (serialize_all) benchmark will follow.
The workflow uses google-benchmark's
compare.py
which performs a null-hypothesis check agains the data to determine whether the results from both benchmark runs are statistically similar or not. This test is currently not used for the validation. This is primarily because I need to understand the tuning of the alpha parameter, as well as what to expect from the test in a noisy environment like GHA.Example Run: here
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.